Life Expectancy

The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The datasets are made available to public for the purpose of health data analysis. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single dataset. On initial visual inspection of the data showed some missing values. As the datasets were from WHO, we found no evident errors. Missing data was handled in R software by using Missmap command. The result indicated that most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo,Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model dataset. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories:​Immunization related factors, Mortality factors, Economical factors and Social factors.

This dataset has been sourced from Kaggle. Source link https://www.kaggle.com/kumarajarshi/life-expectancy-who.

Exploratory Data Analysis

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.1
## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'tidyr' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.1
## Warning: package 'purrr' was built under R version 4.1.1
## Warning: package 'dplyr' was built under R version 4.1.1
## Warning: package 'stringr' was built under R version 4.1.1
## Warning: package 'forcats' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.1
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#Reading csv file
life_expectancy_data <- read.csv("C:/Users/shah3sw/OneDrive - University of Cincinnati/Data_Analysis_Method_Project/Life Expectancy Data.csv")
head(life_expectancy_data)
##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
## 3 Afghanistan 2013 Developing            59.9             268            66
## 4 Afghanistan 2012 Developing            59.5             272            69
## 5 Afghanistan 2011 Developing            59.2             275            71
## 6 Afghanistan 2010 Developing            58.8             279            74
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01              71.279624          65    1154 19.1                83
## 2    0.01              73.523582          62     492 18.6                86
## 3    0.01              73.219243          64     430 18.1                89
## 4    0.01              78.184215          67    2787 17.6                93
## 5    0.01               7.097109          68    3013 17.2                97
## 6    0.01              79.679367          66    1989 16.7               102
##   Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1     6              8.16         65      0.1 584.25921   33736494
## 2    58              8.18         62      0.1 612.69651     327582
## 3    62              8.13         64      0.1 631.74498   31731688
## 4    67              8.52         67      0.1 669.95900    3696958
## 5    68              7.87         68      0.1  63.53723    2978599
## 6    66              9.20         66      0.1 553.32894    2883167
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
## 3                 17.7               17.7                           0.470
## 4                 17.9               18.0                           0.463
## 5                 18.2               18.2                           0.454
## 6                 18.4               18.4                           0.448
##   Schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5
## 6       9.2
#Dimensions : Gives numbers of rows and columns
dim(life_expectancy_data)
## [1] 2938   22
# Structure of dataset
str(life_expectancy_data)
## 'data.frame':    2938 obs. of  22 variables:
##  $ Country                        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year                           : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ Status                         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life.expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B                    : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under.five.deaths              : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ Total.expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV.AIDS                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness..1.19.years           : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5.9.years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.composition.of.resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
#Summary
summary(life_expectancy_data)
##    Country               Year         Status          Life.expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.10  
##  Mode  :character   Median :2008   Mode  :character   Median :72.10  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.70  
##                     Max.   :2015                      Max.   :89.00  
##                                                       NA's   :10     
##  Adult.Mortality infant.deaths       Alcohol        percentage.expenditure
##  Min.   :  1.0   Min.   :   0.0   Min.   : 0.0100   Min.   :    0.000     
##  1st Qu.: 74.0   1st Qu.:   0.0   1st Qu.: 0.8775   1st Qu.:    4.685     
##  Median :144.0   Median :   3.0   Median : 3.7550   Median :   64.913     
##  Mean   :164.8   Mean   :  30.3   Mean   : 4.6029   Mean   :  738.251     
##  3rd Qu.:228.0   3rd Qu.:  22.0   3rd Qu.: 7.7025   3rd Qu.:  441.534     
##  Max.   :723.0   Max.   :1800.0   Max.   :17.8700   Max.   :19479.912     
##  NA's   :10                       NA's   :194                             
##   Hepatitis.B       Measles              BMI        under.five.deaths
##  Min.   : 1.00   Min.   :     0.0   Min.   : 1.00   Min.   :   0.00  
##  1st Qu.:77.00   1st Qu.:     0.0   1st Qu.:19.30   1st Qu.:   0.00  
##  Median :92.00   Median :    17.0   Median :43.50   Median :   4.00  
##  Mean   :80.94   Mean   :  2419.6   Mean   :38.32   Mean   :  42.04  
##  3rd Qu.:97.00   3rd Qu.:   360.2   3rd Qu.:56.20   3rd Qu.:  28.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##  NA's   :553                        NA's   :34                       
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:78.00   1st Qu.: 4.260    1st Qu.:78.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.755    Median :93.00   Median : 0.100  
##  Mean   :82.55   Mean   : 5.938    Mean   :82.32   Mean   : 1.742  
##  3rd Qu.:97.00   3rd Qu.: 7.492    3rd Qu.:97.00   3rd Qu.: 0.800  
##  Max.   :99.00   Max.   :17.600    Max.   :99.00   Max.   :50.600  
##  NA's   :19      NA's   :226       NA's   :19                      
##       GDP              Population        thinness..1.19.years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.10       
##  1st Qu.:   463.94   1st Qu.:1.958e+05   1st Qu.: 1.60       
##  Median :  1766.95   Median :1.387e+06   Median : 3.30       
##  Mean   :  7483.16   Mean   :1.275e+07   Mean   : 4.84       
##  3rd Qu.:  5910.81   3rd Qu.:7.420e+06   3rd Qu.: 7.20       
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.70       
##  NA's   :448         NA's   :652         NA's   :34          
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.10      Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.: 1.50      1st Qu.:0.4930                  1st Qu.:10.10  
##  Median : 3.30      Median :0.6770                  Median :12.30  
##  Mean   : 4.87      Mean   :0.6276                  Mean   :11.99  
##  3rd Qu.: 7.20      3rd Qu.:0.7790                  3rd Qu.:14.30  
##  Max.   :28.60      Max.   :0.9480                  Max.   :20.70  
##  NA's   :34         NA's   :167                     NA's   :163
#Check for missing values
colSums(is.na(life_expectancy_data))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                              10 
##                 Adult.Mortality                   infant.deaths 
##                              10                               0 
##                         Alcohol          percentage.expenditure 
##                             194                               0 
##                     Hepatitis.B                         Measles 
##                             553                               0 
##                             BMI               under.five.deaths 
##                              34                               0 
##                           Polio               Total.expenditure 
##                              19                             226 
##                      Diphtheria                        HIV.AIDS 
##                              19                               0 
##                             GDP                      Population 
##                             448                             652 
##            thinness..1.19.years              thinness.5.9.years 
##                              34                              34 
## Income.composition.of.resources                       Schooling 
##                             167                             163

We found out the missing value in each variable

  1. BMI - 34 11%
  2. Hep B 553 18%
  3. Alcohol - 194 6%
  4. Total Expenditure - 226 7.6%
  5. Diptheria - 19
  6. GDP - 448 15%
  7. Population - 652 22.19%
  8. Income composition of resources - 167 5.6%
  9. Schooling - 163 5.5%
  10. thin 1-19 -34
  11. thin 5-9 - 34

We will replace blank values with mean to avoid errors in our analysis.

# Select numeric variables for calculating mean
life_expectancy_data_num <- select(life_expectancy_data,-c(1,2,3))

#Calculate means of all the numeric variables
colMeans(life_expectancy_data_num, na.rm = TRUE)
##                 Life.expectancy                 Adult.Mortality 
##                    6.922493e+01                    1.647964e+02 
##                   infant.deaths                         Alcohol 
##                    3.030395e+01                    4.602861e+00 
##          percentage.expenditure                     Hepatitis.B 
##                    7.382513e+02                    8.094046e+01 
##                         Measles                             BMI 
##                    2.419592e+03                    3.832125e+01 
##               under.five.deaths                           Polio 
##                    4.203574e+01                    8.255019e+01 
##               Total.expenditure                      Diphtheria 
##                    5.938190e+00                    8.232408e+01 
##                        HIV.AIDS                             GDP 
##                    1.742103e+00                    7.483158e+03 
##                      Population            thinness..1.19.years 
##                    1.275338e+07                    4.839704e+00 
##              thinness.5.9.years Income.composition.of.resources 
##                    4.870317e+00                    6.275511e-01 
##                       Schooling 
##                    1.199279e+01
# Impute missing values in numeric variables with mean
for(i in 4:ncol(life_expectancy_data)) {
  life_expectancy_data[ , i][is.na(life_expectancy_data[ , i])] <- mean(life_expectancy_data[ , i], na.rm=TRUE)
}
summary(life_expectancy_data) 
##    Country               Year         Status          Life.expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.20  
##  Mode  :character   Median :2008   Mode  :character   Median :72.00  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.60  
##                     Max.   :2015                      Max.   :89.00  
##  Adult.Mortality infant.deaths       Alcohol       percentage.expenditure
##  Min.   :  1.0   Min.   :   0.0   Min.   : 0.010   Min.   :    0.000     
##  1st Qu.: 74.0   1st Qu.:   0.0   1st Qu.: 1.093   1st Qu.:    4.685     
##  Median :144.0   Median :   3.0   Median : 4.160   Median :   64.913     
##  Mean   :164.8   Mean   :  30.3   Mean   : 4.603   Mean   :  738.251     
##  3rd Qu.:227.0   3rd Qu.:  22.0   3rd Qu.: 7.390   3rd Qu.:  441.534     
##  Max.   :723.0   Max.   :1800.0   Max.   :17.870   Max.   :19479.912     
##   Hepatitis.B       Measles              BMI        under.five.deaths
##  Min.   : 1.00   Min.   :     0.0   Min.   : 1.00   Min.   :   0.00  
##  1st Qu.:80.94   1st Qu.:     0.0   1st Qu.:19.40   1st Qu.:   0.00  
##  Median :87.00   Median :    17.0   Median :43.00   Median :   4.00  
##  Mean   :80.94   Mean   :  2419.6   Mean   :38.32   Mean   :  42.04  
##  3rd Qu.:96.00   3rd Qu.:   360.2   3rd Qu.:56.10   3rd Qu.:  28.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:78.00   1st Qu.: 4.370    1st Qu.:78.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.938    Median :93.00   Median : 0.100  
##  Mean   :82.55   Mean   : 5.938    Mean   :82.32   Mean   : 1.742  
##  3rd Qu.:97.00   3rd Qu.: 7.330    3rd Qu.:97.00   3rd Qu.: 0.800  
##  Max.   :99.00   Max.   :17.600    Max.   :99.00   Max.   :50.600  
##       GDP              Population        thinness..1.19.years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.10       
##  1st Qu.:   580.49   1st Qu.:4.189e+05   1st Qu.: 1.60       
##  Median :  3116.56   Median :3.676e+06   Median : 3.40       
##  Mean   :  7483.16   Mean   :1.275e+07   Mean   : 4.84       
##  3rd Qu.:  7483.16   3rd Qu.:1.275e+07   3rd Qu.: 7.10       
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.70       
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.10      Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.: 1.60      1st Qu.:0.5042                  1st Qu.:10.30  
##  Median : 3.40      Median :0.6620                  Median :12.10  
##  Mean   : 4.87      Mean   :0.6276                  Mean   :11.99  
##  3rd Qu.: 7.20      3rd Qu.:0.7720                  3rd Qu.:14.10  
##  Max.   :28.60      Max.   :0.9480                  Max.   :20.70
# We can see that now the data set has no missing values
colSums(is.na(life_expectancy_data))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                               0 
##                 Adult.Mortality                   infant.deaths 
##                               0                               0 
##                         Alcohol          percentage.expenditure 
##                               0                               0 
##                     Hepatitis.B                         Measles 
##                               0                               0 
##                             BMI               under.five.deaths 
##                               0                               0 
##                           Polio               Total.expenditure 
##                               0                               0 
##                      Diphtheria                        HIV.AIDS 
##                               0                               0 
##                             GDP                      Population 
##                               0                               0 
##            thinness..1.19.years              thinness.5.9.years 
##                               0                               0 
## Income.composition.of.resources                       Schooling 
##                               0                               0

While predicting life expectancy there could be few outliers that we need to ignore.

#Plotting box plots of life expectancy to understand outliers
boxplot(life_expectancy_data$Life.expectancy, xlab="Life Expectancy")

From the box plot we can see that age below 45 is outlier. Our analysis is not applicable for these records.

Linear Regression

Now we will perform linear regression to identify how each factor contributes to the life expectancy of a person.

Let’s start with the field “Percentage Expenditure”

1. Life expectancy vs Percentage expenditure

Percentge Expenditure represents expenditure on health as a percentage of Gross Domestic Product per capita(%)

First, lets find out correlation between Percentage Expenditure and Life Expectancy

#Plotting box plots of life expectancy to understand outliers
cor(life_expectancy_data$Life.expectancy, life_expectancy_data$percentage.expenditure)
## [1] 0.3817912
model_per_expenditure <- lm(percentage.expenditure ~ Life.expectancy, life_expectancy_data)
summary(model_per_expenditure)
## 
## Call:
## lm(formula = percentage.expenditure ~ Life.expectancy, data = life_expectancy_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2252.1  -940.3  -433.9   274.3 17626.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4787.782    249.204  -19.21   <2e-16 ***
## Life.expectancy    79.827      3.566   22.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1838 on 2936 degrees of freedom
## Multiple R-squared:  0.1458, Adjusted R-squared:  0.1455 
## F-statistic:   501 on 1 and 2936 DF,  p-value: < 2.2e-16

The value of 0.3817912 indicates that there is a moderate positive correlation between percentage expenditure and life expectancy. Estimated coefficient for percentage expenditure is statistically significant, as the associated p value is less than 0.05.

Interpretation would be for every 1k percentage expenditure increase life expectancy increases by 79.827 years

library(plotly)
life_expectancy_vs_percenntage_expenditure <- plot_ly(data = life_expectancy_data, x = ~percentage.expenditure, y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                    color = 'rgba(0, 255, 127, .9)',
                                                    line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                                width = 2)))

life_expectancy_vs_percenntage_expenditure <- life_expectancy_vs_percenntage_expenditure %>% layout(title = 'Scatter Plot: Life Expectancy vs Percentage Expenditure',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_percenntage_expenditure
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Similar Analysis could be done for other variables.

2. Life expectancy vs Hepatitis B

Hepatitis B Hepatitis B (HepB) immunization coverage among 1-year-olds (%)

library(plotly)
life_expectancy_vs_Hepatitis_B <- plot_ly(data = life_expectancy_data, x = ~Hepatitis.B, y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                    color = 'rgba(0,255,0, .9)',
                                                  line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                              width = 2)))

life_expectancy_vs_Hepatitis_B <- life_expectancy_vs_Hepatitis_B %>% layout(title = 'Scatter Plot: Life Expectancy vs Hepatitis B',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Hepatitis_B
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
3. Life expectancy vs Measles

Measles represents the number of reported cases per 1000 population

library(plotly)
life_expectancy_vs_Measles  <- plot_ly(data = life_expectancy_data, x = ~Measles , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                    color =  'rgba(221,160,221, .9)',
                                                     line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                                 width = 2)))
life_expectancy_vs_Measles  <- life_expectancy_vs_Measles  %>% layout(title = 'Scatter Plot: Life Expectancy vs Measles',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Measles 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
4. Life expectancy vs BMI

BMI represents average Body Mass Index of entire population

library(plotly)
life_expectancy_vs_BMI <- plot_ly(data = life_expectancy_data, x = ~BMI, y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                   color = 'rgba(255,182,193, .9)',
                                            line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                        width = 2)))
life_expectancy_vs_BMI <- life_expectancy_vs_BMI %>% layout(title = 'Scatter Plot: Life Expectancy vs BMI',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_BMI
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
5. Life expectancy vs under five deaths

Under five deaths represents the number of under-five deaths per 1000 population

library(plotly)
life_expectancy_vs_under_five_deaths  <- plot_ly(data = life_expectancy_data, x = ~under.five.deaths , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                 color = 'rgba(152,251,152, .9)',
                                           line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                       width = 2)))

life_expectancy_vs_under_five_deaths  <- life_expectancy_vs_under_five_deaths  %>% layout(title = 'Scatter Plot: Life Expectancy vs Under five deaths',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_under_five_deaths 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
6. Life expectancy vs Polio

Polio represents the number of under-five deaths per 1000 population

library(plotly)
life_expectancy_vs_Polio  <- plot_ly(data = life_expectancy_data, x = ~Polio , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                   color = 'rgba(255,0,255, .9)',
                                               line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                           width = 2)))
life_expectancy_vs_Polio  <- life_expectancy_vs_Polio  %>% layout(title = 'Scatter Plot: Life Expectancy vs Polio',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Polio 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
7. Life expectancy vs Total expenditure

Total expenditure represents general government expenditure on health as a percentage of total government expenditure (%)

library(plotly)
life_expectancy_vs_Total_expenditure  <- plot_ly(data = life_expectancy_data, x = ~Total.expenditure , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                color = 'rgba(30,144,255, .9)',
                                                    line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                                width = 2)))
life_expectancy_vs_Total_expenditure  <- life_expectancy_vs_Total_expenditure  %>% layout(title = 'Scatter Plot: Life Expectancy vs Total expenditure',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Total_expenditure 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
8. Life expectancy vs Diphtheria

Diphtheria Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

library(plotly)
life_expectancy_vs_Diphtheria  <- plot_ly(data = life_expectancy_data, x = ~Diphtheria , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                color = 'rgba(0, 255, 127, .9)',
                                                line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                            width = 2)))
life_expectancy_vs_Diphtheria  <- life_expectancy_vs_Diphtheria  %>% layout(title = 'Scatter Plot: Life Expectancy vs Diphtheria ',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Diphtheria 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
9. Life expectancy vs thinness 1 to 19 years

thinness 1 to 19 years Prevalence of thinness among children and adolescents for Age 10 to 19 (% )

library(plotly)
life_expectancy_vs_thinness_1_19_years  <- plot_ly(data = life_expectancy_data, x = ~thinness..1.19.years , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                   color = 'rgba(129, 216, 210, .9)',
                                                     line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                                 width = 2)))
life_expectancy_vs_thinness_1_19_years  <- life_expectancy_vs_thinness_1_19_years %>% layout(title = 'Scatter Plot: Life Expectancy vs Thinness 1 to 19 years',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_thinness_1_19_years
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
10. Life expectancy vs thinness 5 to 9 years

thinness 5 to 9 years Prevalence of thinness among children for Age 5 to 9(%)

library(plotly)
life_expectancy_vs_thinness_5_9_years  <- plot_ly(data = life_expectancy_data, x = ~thinness.5.9.years , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                      color = 'rgba(181, 201, 253, .9)',
                                                  line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                              width = 2)))
life_expectancy_vs_thinness_5_9_years  <- life_expectancy_vs_thinness_5_9_years %>% layout(title = 'Scatter Plot: Life Expectancy vs Thinness 5 to 9 years',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_thinness_5_9_years
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
11. Life expectancy vs Income composition of resources

Income composition of resources Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

library(plotly)
life_expectancy_vs_Income_composition_of_resources <- plot_ly(data = life_expectancy_data, x = ~Income.composition.of.resources , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                      color = 'rgba(181, 201, 253, .9)',
                                                  line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                              width = 2)))
life_expectancy_vs_Income_composition_of_resources  <- life_expectancy_vs_Income_composition_of_resources %>% layout(title = 'Scatter Plot: Life Expectancy vs Income composition of resources',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Income_composition_of_resources
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
12. Life expectancy vs GDP

GDP Gross Domestic Product per capita (in USD)

library(plotly)
life_expectancy_vs_GDP  <- plot_ly(data = life_expectancy_data, x = ~GDP , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                               color = 'rgba(152, 215, 182, .9)',
                                             line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                         width = 2)))
life_expectancy_vs_GDP  <- life_expectancy_vs_GDP  %>% layout(title = 'Scatter Plot: Life Expectancy vs GDP ',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_GDP 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
13. Life expectancy vs Alcohol

Alcohol Gross Domestic Product per capita (in USD)

library(plotly)
life_expectancy_vs_Alcohol  <- plot_ly(data = life_expectancy_data, x = ~Alcohol , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                 color = 'rgba(152, 215, 182, .9)',
                                                 line = list(color = 'rgba(0, 0, 0, 0)',
                                                             width = 2)))
life_expectancy_vs_Alcohol  <- life_expectancy_vs_Alcohol  %>% layout(title = 'Scatter Plot: Life Expectancy vs Alcohol ',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Alcohol 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Multiple Linear Regression

Now that we have seen linear regression showing the relationship of Life Expectancy with each independent variables. Let’s analyse the data set using multiple linear regression.

Multiple linear regression is an extension of simple linear regression used to predict an outcome variable (y) on the basis of multiple distinct predictor variables (x).

With three predictor variables (x), the prediction of y is expressed by the following equation:

y = b0 + b1x1 + b2x2 + b3*x3

The “b” values are called the regression weights (or beta coefficients). They measure the association between the predictor variable and the outcome. “b_j” can be interpreted as the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.

library(tidyverse)


model <- lm(Life.expectancy ~ Alcohol + percentage.expenditure + Hepatitis.B + Measles +  BMI + under.five.deaths + Polio+ Total.expenditure + Diphtheria  + thinness..1.19.years +  thinness.5.9.years + Income.composition.of.resources, data = life_expectancy_data_num)
summary(model)
## 
## Call:
## lm(formula = Life.expectancy ~ Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + under.five.deaths + Polio + 
##     Total.expenditure + Diphtheria + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources, data = life_expectancy_data_num)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1522  -2.5539   0.4574   2.7976  21.8127 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      4.842e+01  7.987e-01  60.627  < 2e-16 ***
## Alcohol                         -3.403e-02  3.628e-02  -0.938 0.348262    
## percentage.expenditure           7.292e-04  7.953e-05   9.169  < 2e-16 ***
## Hepatitis.B                      3.656e-03  6.144e-03   0.595 0.551916    
## Measles                          2.044e-05  1.566e-05   1.305 0.191941    
## BMI                              7.449e-02  7.662e-03   9.722  < 2e-16 ***
## under.five.deaths               -1.000e-03  1.087e-03  -0.920 0.357907    
## Polio                            3.495e-02  7.241e-03   4.828 1.48e-06 ***
## Total.expenditure               -1.576e-02  5.429e-02  -0.290 0.771592    
## Diphtheria                       3.017e-02  8.057e-03   3.744 0.000186 ***
## thinness..1.19.years            -4.666e-02  7.979e-02  -0.585 0.558793    
## thinness.5.9.years              -1.410e-01  7.844e-02  -1.798 0.072391 .  
## Income.composition.of.resources  2.095e+01  8.466e-01  24.749  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.463 on 2070 degrees of freedom
##   (855 observations deleted due to missingness)
## Multiple R-squared:  0.5704, Adjusted R-squared:  0.5679 
## F-statistic:   229 on 12 and 2070 DF,  p-value: < 2.2e-16

The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.

In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.

To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statitic p-values:

summary(model)$coefficient
##                                      Estimate   Std. Error    t value
## (Intercept)                     48.4212274295 7.986680e-01 60.6274772
## Alcohol                         -0.0340324872 3.627505e-02 -0.9381790
## percentage.expenditure           0.0007291457 7.952571e-05  9.1686792
## Hepatitis.B                      0.0036557413 6.144224e-03  0.5949883
## Measles                          0.0000204357 1.565612e-05  1.3052846
## BMI                              0.0744931387 7.662231e-03  9.7221217
## under.five.deaths               -0.0009999939 1.087463e-03 -0.9195656
## Polio                            0.0349547554 7.240603e-03  4.8276027
## Total.expenditure               -0.0157614840 5.428787e-02 -0.2903316
## Diphtheria                       0.0301699912 8.057180e-03  3.7444851
## thinness..1.19.years            -0.0466556368 7.978999e-02 -0.5847305
## thinness.5.9.years              -0.1410030973 7.844090e-02 -1.7975712
## Income.composition.of.resources 20.9521138649 8.465872e-01 24.7489149
##                                      Pr(>|t|)
## (Intercept)                      0.000000e+00
## Alcohol                          3.482619e-01
## percentage.expenditure           1.121298e-19
## Hepatitis.B                      5.519163e-01
## Measles                          1.919410e-01
## BMI                              7.070793e-22
## under.five.deaths                3.579069e-01
## Polio                            1.482979e-06
## Total.expenditure                7.715916e-01
## Diphtheria                       1.856993e-04
## thinness..1.19.years             5.587927e-01
## thinness.5.9.years               7.239068e-02
## Income.composition.of.resources 1.130122e-118

For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.

It can be seen that, change in the Alcohol,BMI,Polio, Total expenditure,Diphtheria, Thinness 1- 19 years, Thinness 5-9 years,Income composition of resources are significantly associated to life expectancy of a person.

For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.

We found that Measles, percentage expenditure,Hepatitis B, under five deaths cariables are not significant in the multiple regression model. We can remove these variables from our analysis.

library(tidyverse)


model <- lm(Life.expectancy ~ Alcohol +  BMI +  Polio+ Total.expenditure + Diphtheria  + thinness..1.19.years +  thinness.5.9.years + Income.composition.of.resources, data = life_expectancy_data_num)
summary(model)
## 
## Call:
## lm(formula = Life.expectancy ~ Alcohol + BMI + Polio + Total.expenditure + 
##     Diphtheria + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources, data = life_expectancy_data_num)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.8980  -2.9211   0.1913   2.8549  24.8361 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     45.318748   0.677932  66.849  < 2e-16 ***
## Alcohol                          0.033620   0.034374   0.978    0.328    
## BMI                              0.093951   0.007609  12.348  < 2e-16 ***
## Polio                            0.046490   0.007006   6.635 3.94e-11 ***
## Total.expenditure                0.016001   0.052026   0.308    0.758    
## Diphtheria                       0.045517   0.006991   6.511 8.97e-11 ***
## thinness..1.19.years            -0.114326   0.074353  -1.538    0.124    
## thinness.5.9.years              -0.094107   0.072858  -1.292    0.197    
## Income.composition.of.resources 21.599486   0.727366  29.695  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.86 on 2547 degrees of freedom
##   (382 observations deleted due to missingness)
## Multiple R-squared:  0.6124, Adjusted R-squared:  0.6112 
## F-statistic:   503 on 8 and 2547 DF,  p-value: < 2.2e-16

Finally our model can be written as follow:

Life_Expectancy = 45.318748 + 0.033620 x Alcohol + 0.093951 x BMI +0.046490 x Polio + 0.016001 x Total.expenditure + 0.045517 x Diphtheria -0.114326 x Thinness..1.19.years -0.09410 x Thinness.5.9.years + 21.599486 x Income.composition.of.resources

The confidence interval of the model coefficient can be extracted as follow:

confint(model)
##                                       2.5 %      97.5 %
## (Intercept)                     43.98939399 46.64810222
## Alcohol                         -0.03378338  0.10102314
## BMI                              0.07903126  0.10887137
## Polio                            0.03275133  0.06022859
## Total.expenditure               -0.08601684  0.11801788
## Diphtheria                       0.03180879  0.05922563
## thinness..1.19.years            -0.26012391  0.03147274
## thinness.5.9.years              -0.23697353  0.04875936
## Income.composition.of.resources 20.17319681 23.02577503

Model accuracy assessment

As we have seen in simple linear regression, the overall quality of the model can be assessed by examining the R-squared (R2) and Residual Standard Error (RSE).

R-squared:

In multiple linear regression, the R2 represents the correlation coefficient between the observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the value of R will always be positive and will range from zero to one.

R2 represents the proportion of variance, in the outcome variable y, that may be predicted by knowing the value of the x variables. An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.

A problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the response (James et al. 2014). A solution is to adjust the R2 by taking into account the number of predictor variables.

The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the prediction model.

Residual Standard Error (RSE), or sigma:

The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand).

The error rate can be estimated by dividing the RSE by the mean outcome variable:

sigma(model)/mean(life_expectancy_data$Life.expectancy)
## [1] 0.08465579

end